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Abstract 

Extracting communities using existing community detection algorithms yields dense sub-net'works that are 
difficult to analyse. Extracting a smaller sample that embodies the relationships of a list of suspects is an 
important part of the beginning of an investigation. In this paper, we present the efficacy of our shortest 
paths network search algorithm (SPNSA) that begins with an ‘algorithm feed’, a small subset of nodes 
of particular interest, and builds an investigative sub-network. The algorithm feed may consist of known 
criminals or suspects, or persons of influence. This sets our approach apart from existing community 
detection algorithms. We apply the SPNSA on the Enron Dataset of e-mail communications starting with 
those convicted of money laundering in relation to the collapse of Enron as the algorithm feed. The algorithm 
produces sparse and small sub-networks that could feasibly identify a list of persons and relationships 
to be further investigated. In contrast, we show that identifying sub-networks of interest using either 
community detection algorithms or a k-Neighbourhood approach produces sub-networks of much larger size 
and complexity. When the 18 top managers of Enron were used as the algorithm feed, the resulting sub¬ 
network identified 4 convicted criminals that were not managers and so not part of the algorithm feed. We 
also directly tested the SPNSA by removing one of the convicted criminals from the algorithm feed and 
re-running the algorithm; in 5 out of 9 cases the left out criminal occurred in the resulting sub-network. 
Keywords: Criminal network, Shortest path. Leave-one-out, Trust, Suspect, Investigation 


1. Introduction 


Retrieving a criminal network from an organised crime incident is an important part of crime investi¬ 
gation. This task is a difficult one, mainly because of the involvement of a variety of criminals who play 


myriad roles (Basu 2014 Didimo et al. 2011). In addition to drug trafficking and money laundering, or¬ 


ganised crime includes hijacking and equipment smuggling. The task of the criminal investigator is further 
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hampered by the mass of data needing to be searched with an important part of the start of an investigation 
being the identification of a smaller sample that embodies the relationships within the criminal participants. 


In (Magalingam et ah, 20141, we presented an algorithm designed to extract a smaller, more manageable, 
network of possible relationships from a large dataset of interactions. In this paper, we further develop 
this algorithm and show that it performs well in a variety of scenarios, and is able to extract meaningful 
sub-networks for a criminal investigator to start an investigation. We show that this algorithm performs 


better than known community detection algorithms (Pons 2006 Clauset et al. 2004 Newman 20061, as 


well as k-neighbourhood detection methods (Zhou and Pei 20111. 

In the past, extracting criminal associations from raw data has required preliminary information of such 


relationships, while building a network from such, known, relationships has been done manually (Basu 2014 


Didimo et al., 2011 Christin et al. 2010 Oatley and Crick 2014). For example, Nadji et al. (2013) produce 


a network of known fraudulent infrastructure by creating links between IP addresses using known attack 
signatures garnered from passive domain name server and several other sources for malicious activities. 


Krebs (2002) builds edges between known hijackers of the 9-11 terrorist attacks by manually gathering 


data from online news articles. The edges, or links, are created based of information such as whether the 


two persons went to the same school, grew up in the same locality, etc. Oatley and Crick (2014) follow a 
similar track, using associations such as partner, sibling, cohabitant, to build a relationship network among 
the members of different UK crime gangs. Clearly, the above methods are time-consuming, and a faster, 
more automated process of building a relationship network would be very useful for investigators of criminal 
activities. 

We present such an algorithm, which can be run on a large dataset of interactions, to build a more 
practicable sub-network of known criminals suitable for further investigation. We use the publicly available 


Enron Dataset (Cohen, 2009), which contains all email communications before and after the collapse of this 
large company in 2001. This dataset is appropriate for this exercise, as ten people connected with Enron 


were subsequently convicted of money laundering (Securities and Release 2004). 

The structure of the rest of the paper is as follows: In the next section, we describe the Enron dataset 
in more detail, give the process by which we start the isolation of specific email groupings, compare the 
connections between the ten criminals in two different email sub-networks, and describe our algorithm. 
Section gives the results of applying existing community detection algorithms as well as the k-nearest 
neighbour method, to the Enron dataset to identify the community that the criminals belong. In the 
section after this, we apply our shortest paths network search algorithm to the two email sub-networks 
previously identified and compare the results to those obtained by applying the existing community detection 
algorithms. The penultimate section details the application of our algorithm to the different scenarios that 
an investigator may encounter. Finally we give the conclusion. 
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2. Background 


This section describes the preliminary analysis of the Enron email dataset, the people who were convicted 
of money laundering crime, the identification of criminal communication links and the criminal sub-network 
formation methods. 


2.1. Preliminary analysis of dataset 


The Enron email dataset contains 1,887,305 email transactions (Cohen 20091 that were sent using the 


fields ‘TO’, ‘CC’ or ‘BCC’. Out of these emails, 16,116 are senders of the emails and 68,203 are receivers 
of the emails. The Enron email dataset contains a mix of internal and external email transactions. Within 
the 16,116 email senders, 5,831 email transactions are from email addresses that are Enron company email 
accounts having the name ‘enron’ in their email address and the rest of the addresses are external, for example 
andrew.fastow@ljminvestments.com, anitatr@earthlink.net, etc. In order to process this large number of 
emails, we start by extracting the emails sent and received in the last 8 years of Enron - from 1995 to 2002 


(Salter, 2008). We clean the data by removing the irrelevant email transactions such as email addresses that 


have numbers and characters for example ‘5673@aol.com’, that end with airline company name for example 
‘@aircanada.com’, that end with ‘xpedia.com’, ‘amazon.com’ and other auto response emails. 

Several prior works propose ways of extracting criminal networks in the form of associations between 


texts or people (Basu 2014 Krebs 2002). Mining relevant terms from a large volume of police incident 


summaries and assigning the co-occurrence frequency as a weight to each term is used by (Chen et al., 2004) 


to design a criminal network while Yang and Ng (2007) use web crawlers to gather identities associated with 


certain crime related topics in web blog pages and represent them as a network. Similarly, in order to identify 


criminal cliques, Iqbal et al. (2012) perform chat topic analysis and certain entities that belong to the same 


chat session are formed into a clique. Louis and Engelbrecht (2011) conduct text mining on passages of a 


mystery novel to show the association between words, in the form a graph, leading to the identification of 


murders. In (Anwar and Abulaish 2012), posts that promote hate and violence in certain dark web forums 


are grouped in different cliques using an algorithm that measures similarity based on content, time, author 
and title. 

Using keywords as a tool for isolating criminal networks is a problem especially when electronic doc¬ 
uments, chat messages, web blogs or emails contain incomplete information or could mislead detection 


algorithms (Murynets and Piqueras Jover 2012 Keila and Skillicorn 2005). Consequently, we choose to 


ignore the content of the various emails being exchanged between the criminals and propose a very different 
way to start the building of a criminal network, by considering the type of emails based on recipient fields. 


As detailed in (Magalingam et al. 2014), we separate the emails with at least one BCC recipient because 


the existence of a bcc in an email, could indicate a trust relationship (Fox and Schaefer 2012). While ‘to’ 
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and ‘cc’ recipients are visible to all recipients, as (McDowell and Householder 20091 and (Bogawar and 


Bhoyar 2012) point out, there is something inherently secretive about adding a ‘bcc’ recipient to an email. 


We explore the suspect’s secret connections in the group of emails that consists of all those with one or 
more bcc-ed recipients and compare our result with the connections found using email transactions that 
have recipients only in the ‘TO’ and ‘CC’ fields. 

The emails are divided into two groups. The first group is made up of email transactions that have 
recipients in the ‘TO’ and ‘CC’ fields. These emails do not contain any BCC recipients. The network 
formed using the ‘TO’ and ‘CC’ email transactions has 26,027 nodes and 1,048,572 edges with an average 
degree of 80.58. Henceforth, we refer to this network as Netgraph and each node ID is now called Net. 
ID. The relationship between the nodes’ degrees and their frequencies is displayed in the log-log plot of the 
degree distribution (Figure(a)) . 



Node Degree 

(a) Degree distribution of ‘TO 
& CC’ email group 



Node Degree 


(b) Degree distribution of 
‘BCC’ email group 


Figure I: The figures show that the degree distributions for both Netgraph (a) and BCC Netgraph (b) are heavy¬ 
tailed with many nodes being of low degree and some nodes being highly connected. 

The second group of emails consists of all those with one or more bcc-ed recipients. The network formed 
using the BCC email transactions is called the BCC Netgraph with each node in this BCC Netgraph given 
a BCCNet. ID. The BCC Netgraph contains 19,716 nodes and 238,761 edges. The BCC Netgraph has an 
average degree of 24.22 and the log-log plot of the degree distribution is shown in Figure]^ (b). As is evident 
from Figure (a) and (b), both networks have many nodes with low degree and a few nodes with very high 
degree indicating the degree distributions are heavy tailed. Sub-networks are then constructed with each of 


these large networks using our shortest paths network search algorithm (see Section 2.4). Next the persons 


convicted of money laundering in relation to the collapse of Enron are listed along with their Net. IDs and 
BCCNet. IDs. 


2.2. Enron money laundering criminals 


Ten people were convicted of money laundering in relation to the collapse of Enron (Brickey, 2003). 


Tablebelow shows the Net. ID of the criminals appearing within Netgraph as well as the BCCNet. ID of 
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those appearing within the BCC Netgraph. 


Table 1: Enron money laundering criminals 


Name 

Net. ID 

BCCNet. ID 

Email Address 

Andrew Fastow 

1472 

686 

andre w. fastow@enron .com 

Andrew Fastow 

- 

687 

andrew.fastow@ljminvestments.com 

Lea Fastow 

17589 

11010 

lfastow@pop.pdq.net 

Lea Fastow 

17588 

11009 

lfastow@pdq. net 

Kevin Hannon 

16202 

10068 

kevin.hannon@enron.com 

Kenneth Rice 

16115 

9994 

kenneth.rice@enron.com 

Rex Shelby 

23983 

15224 

rex.shelby@enron.com 

Rex Shelby 

23985 

15225 

rex_shelby@enron.net 

A. Khan 

- 

205 

adnankkhan@hotmail.com 

Michael Kopper 

20217 

12708 

michael.kopper@enron.com 

Ben Glisan 

- 

1369 

ben.glisan@enron.com 

Joe Hirko 

14052 

8716 

joe.hirko@enron.com 

S. Yaeger 

- 

861 

anne.yaeger@enron.com 


Table gives the list of e-mail accounts associated with criminals involved in the Enron money laundering crime 
( Brickeyj |2003 Cohen 20091. The IDs in the table are computer generated numbers assigned to distinct email 
addresses based on the type of network. The Net. ID refers to the email addresses of criminals in the Netgraph while 
the BCCNet. ID refers to the email addressess of criminals in the BCC email network. 


The multiple email addresses of the criminals, leading to multiple, different IDs, are preserved as some 
of these criminals, for example Andrew Fastow (BCCNet. ID 687), A. Khan (BCCNet. ID 205), Ben Glisan 
(BCCNet. ID 1369) and S. Yaeger (BCCNet. ID 861), did not occur in the Netgraph but were present in 
the BCC Netgraph. Each of these email addresses occur in distinct email transactions. We now identify the 
length of the shortest paths between these criminals in both the Netgraph and the BCC Netgraph. 


2.3. Distribution of criminal links in Netgraph and BCC Netgraph 

Here, an analysis is conducted to compare the connections formed in the Netgraph and BCC Netgraph. 
Network measures are often used to quantify network structures, for example, the number of vertices in 
a network measures the size of the network, a vertex’s degree could be used to show whether it is strong 
or weak, the shortest path length measures the distance between vertices, centrality measures demonstrate 


the level of importance of vertices, etc. (Newman, 2010). We use the length of the shortest paths and the 
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degree of the node. Both the Netgraph and the BCC Netgraph are directed graphs. In order to analyse a 
criminal’s communication links within these graphs, we first calculate the length of the shortest paths from 
one criminal to another. Tables and shows the length of shortest paths from criminal to criminal in the 
directed Netgraph and the BCC Netgraph respectively. The directed Netgraph has an average path length 
3.203258 while the directed BCC Netgraph’s average path length is 4.445533. 


Table 2: Shortest path length from criminal to criminal in the directed Netgraph 



1472 

17589 

17588 

Directed Netgraph 

16202 16115 23983 

23985 

20217 

14052 

1472 

0 

3 

3 

4 

3 

2 

3 

3 

3 

17589 

Inf 

0 

Inf 

Inf 

Inf 

Inf 

Inf 

Inf 

Inf 

17588 

Inf 

Inf 

0 

Inf 

Inf 

Inf 

Inf 

Inf 

Inf 

16202 

Inf 

Inf 

Inf 

0 

Inf 

Inf 

Inf 

Inf 

Inf 

16115 

Inf 

Inf 

Inf 

Inf 

0 

Inf 

Inf 

Inf 

Inf 

23983 

2 

3 

3 

4 

2 

0 

2 

2 

3 

23985 

Inf 

Inf 

Inf 

Inf 

Inf 

Inf 

0 

Inf 

Inf 

20217 

Inf 

Inf 

Inf 

Inf 

Inf 

Inf 

Inf 

0 

Inf 

14052 

Inf 

Inf 

Inf 

Inf 

Inf 

Inf 

0 

Inf 

0 


Table shows the lengths of shortest paths from criminal to criminal in the directed Netgraph. 
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Table 3: Shortest path length from criminal to criminal in the directed BCC Netgraph 


Directed BCC Netgraph 


686 687 11010 11009 10068 12708 1369 9994 8716 15224 15225 861 205 


686 0 


11010 3 


Inf Inf 


11009 Inf Inf Inf 


Inf Inf Inf Inf 


Inf Inf 


10068 2 


12708 Inf Inf Inf 


Inf Inf Inf Inf 


Inf Inf 


9994 Inf Inf Inf 


Inf Inf 


8716 Inf Inf Inf 


Inf Inf 


Inf Inf 


15224 3 


15225 Inf Inf Inf 


Inf Inf Inf Inf 


Inf Inf Inf 


Inf Inf Inf Inf 


Inf Inf Inf 


Inf Inf Inf Inf 


Table 1^ shows the lengths of shortest paths from criminal to criminal in the directed BCC Netgraph. 


From Tablesandit is clear that, in the directed graphs under consideration, only a few criminals have 
directed paths connecting them to other criminals. If we make the broad assumption that an email sent from 
A to B implies an undirected relationship between A and B then the graphs become undirected. In this case, 
in BCC Netgraph, 12 of the 13 accounts associated with criminals belong to the same connected component 
and a path can be found from one criminal’s account to another’s (see Table [^. The exception is the 
account associated with A. Khan (adnankkhan@hotmail.com) which belongs to a separate component. The 
assumption regarding reciprocal relationship seems most appropriate for the trust network (BCC Netgraph) 
where if A includes B as a BCC recipient there is a personal trust relationship implied between A and B 
that we assume is reciprocated to some degree. 

The shortest path lengths between the criminals in the undirected graphs are shown in Tables and 
The average path length values of the undirected Netgraph and the undirected BCC Netgraph are 3.264676 
and 5.033507 respectively. The average path length of the criminals in the undirected Netgraph and the 
undirected BCC Netgraph are 2.93 and 3.65 respectively; lower than the average path length of the entire 
graphs. 
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Table 4: Shortest path length from criminal to criminal in the undirected Netgraph 


Undirected Netgraph 


1472 17589 17588 16202 16115 23983 23985 20217 14052 


1472 0 


17589 3 


17588 2 


16202 3 


16115 2 


23983 2 


23985 3 


20217 2 


14052 2 


Table above shows the lengths of shortest paths from criminal to criminal in undirected Netgraph. 


Table 5: Shortest path length from criminal to criminal in the undirected BCC Netgraph 

Undirected BCC Netgraph 

686 687 11010 11009 10068 12708 1369 9994 8716 15224 15225 861 205 


11010 3 


11009 3 


10068 2 


12708 2 


15224 3 


15225 5 


Inf Inf Inf 


Inf Inf Inf Inf 


Table above shows the lengths of shortest paths from criminal to criminal in undirected BCC Netgraph. 


The distance between any two criminals in the undirected Netgraph ranges from 2-4 (see Table while 
the undirected BCC Netgraph it ranges from 1-7 (see Table [^. Using the shortest paths’ lengths count, 




there are variations in the connections formed between criminals in the undirected BCC Netgraph compared 
to the undirected Netgraph. In the undirected BCC Netgraph, there are some direct links between certain 
criminals, for example Andrew Fastow (BCCNet. ID 686) to Lea Fastow (11010), Andrew Fastow (686) to 
Ben Glisan (1369) and from Michael Kopper (12708) to Ben Glisan (1369). The emphasis of this paper is on 
the BCC Netgraph. In the next subsection, we describe our shortest paths network search algorithm briefly. 


2.4- Criminal network formation methods 

Past research shows that a criminal community can be formed using certain pre-defined rules. For exam¬ 


ple in (Al-Zaidy et al., 2012), a set of people belong to the same community if their names appear together in 


a document while (Anwar and Abulaish 2014) group people into a community based on overlapping interests 
across different chat sessions. Our shortest paths network search algorithm is used to form a relationship 


network between suspects as a basis for an investigation (Magalingam et ah, 20141. Unlike Al-Zaidy et al. 


(2012) and Anwar and Abulaish (2014), our algorithm does not restrict community membership to those 


with similarity in email content. By retaining duplicate email addresses (unlike Al-Zaidy et al. (2012)), we 
show that these could indicate secret trusted connections. Our network link doesn’t depend on overlapping 


interest such as in Anwar and Abulaish (2014) but depends on a node’s associations with particular central 
nodes and its links to known suspects. In our algorithm, the node or edge with highest centrality value 


is used. Girvan and Newman (Girvan and Newman 2002 Newman and Girvan 2004) remove the edge 
with highest betweenness score till they hnd different sub-networks or communities. In more recent work. 


(Ferrara et al., 2014) use a log analysis tool that adapts several community detection algorithms as well 


as utilising modular optimisation as done in Girvan and Newman algorithm {Girvan and Newman 20021 


and Newman’s fast algorithm (Newman 2004) to detect criminal organisations using phone call networks. 


Unlike the log analysis tool of (Ferrara et ah, 2014) that implements Girvan and Newman’s algorithm, we 
retain the central nodes, using them to form sub-networks of interest. 


The shortest paths network search algorithm (SPNSA) is described in detail in (Magalingam et al. 


2014). Firstly the algorithm requires a ‘feed’, a number of nodes of interest, whether these are known to be 


suspected criminals or otherwise regarded as relevant or important. For the Enron email network, each email 
account is used as a feed. For example, if a criminal has two email accounts, both the accounts are used in 
the feed list to represent that one criminal (see Table [^. Then the algorithm works by isolating a particular 
ego network, and within that ego network, identifying two central nodes, one with highest betweenness 
centrality and the other with highest eigenvector centrality. These nodes are named the Middle Man (MM) 
and the Most Influential (MI) respectively. Within the same ego network, the algorithm proceeds to extract 
the shortest paths from the ego to the central nodes, as well as from the ego to other nodes of interest in the 
list and finally, from the other nodes of interest to the central nodes. The three steps are repeated for every 
node of interest by selecting each in turn as the ego. The results of the various extractions are combined to 
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form a new sub-network. In Magalingam (2014), we studied the use of SPNSA on the directed BCC email 
network containing emails with 1 and 2 recipients bcc-ed. We found that the sub-network formed using 
SPNSA suggested possible people to investigate between known criminals and financial managers. This new 


sub-network could be used by an investigator as a preliminary investigative network (Magalingam et al. 


2014). In this paper, we apply the SPNSA to analyse larger subsets of the Enron dataset than those studied 


in (Magalingam et al. 2014) in addition to comparing its performance against known community detection 
algorithms. Before applying the SPNSA, we use the different community detection algorithms to discover 
the various criminals’ communities in the Netgraph and BCC Netgraph. 


3. Discovering criminals’ community using community detection algorithms 

Many authors have used algorithms to analyse community structure and to consequently identify groups 
or sub-networks. An example of this is the use of network modularity in community detection algorithms 


(Girvan and Newman 2002 Newman and Girvan 2004). Communities exist when a graph consists of sets 


of nodes in tightly knit groups joined together by weaker connections between these groups (Girvan and 


Newman 2002 Newman, 2010). The link structure and node attributes are the common components used 


in community detection algorithms. Girvan and Newman (Girvan and Newman 2002 Newman and Girvan 


2004) repeatedly calculate edge betweenness, each time removing the edge with highest betweenness score 


such that as the graph becomes disconnected, the components represent each of communities. Other link 


based community detection approaches can be found in (Radicchi et al. 2004). 


A different way of identifying communities is by using the node based approach called agglomerative 


algorithms (Blondel et al., 20081. Pons and Latapy (Pons 2006) introduce a random walk concept which 


picks nodes from a network based on a fixed distance between two nodes. EAGLE is a software algorithm 


created by Shen et al. (Shen et al., 2009) that follows certain steps to form a community. Eirst, it adapts 


the maximal clique calculation introduced by Bron and Kerbosch (]Bron and Kerbosch 1973), then removes 


the subordinate maximal clique. The algorithm then calculates the similarity between each pair of nodes 
in the clique, merges them into a new community, finds the similarity of the new community by comparing 
with an already existing community, repeating the steps until only one community remains. 


Fastgreedy community detection (Glauset et al. 2004) uses a modularity optimization algorithm by first 


computing the fraction of within-community edges in a network, then subtracting from it the expected 
fraction of edges in a randomized version of the same network with same degree distribution. A nonzero 


value above 0.3 is considered a good measurement for the density of links inside communities (Glauset et al. 


2004) (Blondel et al. 2008). Walktrap community detection merges similar nodes that are obtained using 


short random walks into a group (Pons, 2006). The leading eigenvector community detection algorithm 


implements the modularity optimization algorithm. It computes the modularity matrix and the eigenvector 
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of the matrix. It then divides the community based on the positive or negative sign of the elements in 
the eigenvector. If the large elements have the same sign, then the network has no community structure 


(Newman 20061. 


The results produced by applying these different methods to the Enron dataset to identify criminals’ 
communities are discussed next. The community detection methods used are fc-Neighbourhood, Fastgreedy, 
Walktrap and Leading Eigenvector algorithms all of which are available in the R igraph tool. Due to the 
connections between criminals being more visible in the undirected graph compared to the directed graph 
(See Tables and i and since the majority of the community algorithms can only be applied to 

undirected graphs, we use the undirected Netgraph and BCC Netgraph for this exercise. We will later 


compare (in Section 4.2) these results with those obtained by using our shortest paths network search 
algorithm. 


3.1. k-Neighbourhood detection 

We first compute the total degree of each criminal in both the undirected Netgraph and BCC Netgraph. 
Table shows the values. The neighbourhood function has been used previously to form a subgraph and 
identify the nearest link from a criminal node (Savage et al., 2014 Yasin et al. 2014). Using Table[^ we find 
all the neighbours of Andrew Fastow (686), (the hrst criminal in our list) in the undirected BCC Netgraph 
at a distance of 1 to 4 and form the networks. 
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Table 6: Enron money laundering criminals’ degree 


Name 

Net. ID 

Degree 

BCCNet. ID 

Degree 

Andrew Fastow 

1472 

261 

686 

25 

Andrew Fastow 

- 

- 

687 

1 

Lea Fastow 

17589 

3 

11010 

4 

Lea Fastow 

17588 

4 

11009 

4 

Kevin Hannon 

16202 

1 

10068 

36 

Kenneth Rice 

16115 

11 

9994 

4 

Rex Shelby 

23983 

97 

15224 

21 

Rex Shelby 

23985 

2 

15225 

2 

A. Khan 

- 

- 

205 

2 

Michael Kopper 

20217 

40 

12708 

6 

Ben Glisan 

- 

- 

1369 

105 

Joe Hirko 

14052 

11 

8716 

2 

S. Yaeger 

- 

- 

861 

1 


Table shows the total degree of each criminal in Netgraph and BCC Netgraph. 

Figurej^shows the networks found using 1,2,3,4-neighbourhood of Andrew Fastow (686) in the undirected 
BCC Netgraph respectively. 
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(a) 1-N network (26 nodes, 149 edges) 



(c) 3-N network (8,433 nodes, 49,086 edges) 


(b) 2-N network (1,559 nodes, 15,853 edges) 



(d) 4-N network (14,916 nodes, 60,593 edges) 


Figure 2: The figures above show the networks formed by using 1,2,3,4-neighbourhood (1,2,3,4-N) of Andrew 
Fastow (686) in the undirected BCC Netgraph respectively. In the 1-neighbourhood of Andrew Fastow only one 
other criminal was found, Ben Glisan (1369). The 2, 3, and 4- neighbourhood networks are clearly too dense to be 
able to identify other criminals easily. 


In the 1-neighbourhood network of Andrew Fastow only one criminal was found, Ben Glisan (1369). 
According to Tables [^IH fusing either of the Andrew Fastow’s email account (BCCNet. ID 686 or 687) 
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and either of the undirected or directed BCC Netgraphs, the one or two neighbourhoods would only rarely 
contain the other known criminals. The size of the network becomes bigger as the neighbourhood increases. 
The same method when applied to the undirected Netgraph, also produces large graphs that are difficult 
to explore. Clearly, using these dense network graphs, it is difficult to analyse a criminal’s occurrence and 
connections with other nodes. 


3.2. Community detection algorithms in R igraph 

In the next two sub-sections, |3.2TT]and[3. 2. 2[ we compare the number of criminals, communities and con¬ 


nection of criminals to other nodes found using the community detection algorithms: Fast greedy (Clauset 


et al. 20041, Walktrap (Pons 2006) and Leading Eigenvector (Newman 2006). As all three of these algo¬ 


rithms require undirected graphs, we use the undirected Netgraph and BCC Netgraph for this experiment. 


3.2.1. Results of undirected Netgraph 

Here we look at the results of applying the community detection algorithms to the undirected Netgraph. 
Applying the Fastgreedy community detection algorithm gives 15 communities. These communities range 
in size from 5,803 nodes to just 2 nodes. 6 out of 10 criminals were found in the second largest community 
that had 5,163 nodes and 22,936 edges. One criminal, Kevin Hannon (16202) appeared in a much smaller 
group consisting of 230 nodes and 781 edges. 

Next, the Walktrap community detection algorithm was applied to the undirected Netgraph with the 
length of the random walk being 10 steps. 530 communities were found by the algorithm, with the first 
community detected consisting of 3,126 nodes and 12,462 edges, and again contained 6 of the 10 criminals 
(See Table [^. It was the second largest community formed by Walktrap. The largest community had 7,935 
nodes while smallest one had just 1 node. 

The third community detection algorithm used was the Leading Eigenvector. This detection algorithm 
detected 2 communities with the largest one containing 26,025 nodes and the smallest 2 nodes. All 7 
criminals appeared in the largest community but the criminals were found to be isolated (See Table [^. 

3.2.2. Results of undirected BCC Netgraph 

The community detection algorithms were next applied to the undirected BCC network graph. The 
BCC Netgraph contains 65,532 edges and 19,716 nodes. The Fastgreedy algorithm found 832 communities, 
finding a number of small communities with less than 6 nodes each. 5 criminals were found to be in the 
largest community that had 2,195 nodes and 7,903 edges; Andrew Fastow (BCCNet. ID 686), Lea Fastow 
(BCCNet. ID 11010, BCCNet. ID 11009), Kevin Hannon (BCCNet. ID 10068), Ben Glisan (BCCNet. ID 
1369) and Kenneth Rice (BCCNet. ID 9994). Ben Glisan (BCCNet. ID 1369) had the highest degree in 
this community. Meanwhile, Rex Shelby (15224, 15225) and S. Yaeger (861) belonged to the second largest 
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community with 2,142 nodes and 8,222 edges. The criminal who was in one of the smaller communities was 
Joe Hirko (8716). 

Using the Walktrap community detection algorithm produced 1,773 communities from the undirected 
BCC Netgraph. The largest community had 1,493 nodes and the smallest one had just 1 node. Seven out 
of 10 criminals happened to exist in the same community that had 1,254 nodes (See Table [^. The Leading 
Eigenvector community detection algorithm found 719 communities in the BCC Netgraph. It ranged from 
the largest community with 15,792 nodes and the smallest with 1 node. Most of the criminals appeared 
in the largest community. In the Table we list the communities where the criminals belong to and the 
community size that we identified manually. 

Table 7: Criminals Found in Different Communities 


Net. ID 

FG Com. ID 

WT Com. ID 

LEC Com. ID 

BCCNet. ID 

FG Com. ID 

WT Com. ID 

LEC Com. ID 

1472 

{7, 5163} 

{1, 3126} 

{1, 26025} 

686 

{5, 2195} 

{36, 1254} 

{1, 2001} 

- 

- 

- 

- 

687 

{5, 2195} 

{594, 2} 

{719, 15792} 

17589 

{7, 5163} 

{1, 3126} 

{1, 26025} 

11010 

{5, 2195} 

{594, 2} 

{719, 15792} 

17588 

{7, 5163} 

{1, 3126} 

{1, 26025} 

11009 

{5, 2195} 

{36, 1254} 

{719, 15792} 

16202 

{10, 230} 

{29, 228} 

{1, 26025} 

10068 

{5, 2195} 

{36, 1254} 

{719, 15792} 

16115 

{7, 5163} 

{1, 3126} 

{1, 26025} 

9994 

{5, 2195} 

{36,1254} 

{719, 15792} 

23983 

{7, 5163} 

{1, 3126} 

{1, 26025} 

15224 

{3, 2142} 

{36, 1254} 

{719, 15792} 

23985 

{7, 5163} 

{4, 7935} 

{1, 26025} 

15225 

{3, 2142} 

{1365, 1} 

{719, 15792} 

- 

- 

- 

- 

205 

{224, 3} 

{1030, 3} 

{719, 15792} 

20217 

{7, 5163} 

{1, 3126} 

{1, 26025} 

12708 

{2, 2113} 

{36, 1254} 

{1, 2001} 

- 

- 

- 

- 

1369 

{5, 2195} 

{45, 1493} 

{719, 15792} 

14052 

{7, 5163} 

{1, 3126} 

{1, 26025} 

8716 

{17, 561} 

{36,1254} 

{719, 15792} 

- 

- 

- 

- 

861 

{3, 2142} 

{54, 1101} 

{719, 15792} 

Total community 

15 

530 

2 

- 

832 

1773 

719 


Table shows community IDs to which each criminal belongs. The community and the size of each community is 
represented in curly brackets as {ith community, size}. The title of each column are Net. ID (Netgraph ID), FG Com. 
ID (Fastgreedy Community ID), WT Com. ID (Walktrap Community ID) and LEC Com. ID (Leading Eigenvector 
Community ID). The total number of communities formed is shown at the bottom of the table. The communities 
with smallest number of nodes, 1-3 nodes are highlighted in red. 
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3.2.3. Discussion of results obtained by R igraph community detection algorithms 

The network partitioning using leading eigenvector to form communities seems not to be very effective 
for either the undirected Netgraph or BCC Netgraph, as the biggest community contains almost all the 
nodes. Some abnormal network structures were found in the communities of certain criminals (see Table 
[^. From the results of the Walktrap algorithm (highlighted in red), Andrew Fastow (687) and Lea Fastow 
(11010) belong to a small group of just two nodes, forming a community on their own. A. Khan (205) also 
belongs to a community of just three nodes. A. Khan is linked to two other nodes with email addresses; 
toriarules@aol.com and mmorales@arnel.com. These emails addresses are found to be external emails that do 
not belong to the Enron company email group. The Fastgreedy algorithm results in Rex Shelby (15225) being 
isolated in a community of his own, numbered 1365. All of these abnormalities occurred in the undirected 
BCC Netgraph. We also found more communities appearing in the undirected BCC Netgraph compared to 
the undirected Netgraph. The total number of communities formed using each detection algorithm is also 
shown in the last row of Table [3 

The community that contains the most criminals; community numbered 5 using Fastgreedy algorithm 
and community numbered 36 using Walktrap algorithm on the undirected BCC Netgraph were extracted. 
The detection using Walktrap algorithm on the BCC Netgraph yields the best result, some criminals appear 
in a small community on their own and 7 of 1254 members were criminals, but the result shows enormous 
number of nodes and links; networks (as shown in Figure]^ an investigator would need to analyse in order 
to find any connections between criminals and other nodes. 



(a) Community numbered 5 using Fast Greedy algorithm (b) Community numbered 36 using WalkTrap algorithm 

Figure 3: Figures above shows the community numbered 5 (2195 nodes) using fast greedy algorithm and community 
numbered 36 (1254 nodes) nsing walktrap algorithm on undirected BCC Netgraph. 


Another method called clique percolation community detection developed by (Palla et al. 2005) was also 
used to identify the Enron criminal community. We found that the clustering coefficient values for both the 
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undirected Netgraph and BCC Netgraph were so low that the nodes did not converge to form communities. 
A high average clustering coefficient to a respective random network is needed to form sub-networks or 


clusters (Palla et al. 2005 Hills et al. 2009). In the next section, we apply our shortest paths network 


search algorithm to all four networks, the directed and undirected, Netgraph and BCC Netgraph. 


4. Application of Shortest Path Network Search Algorithm 


In (Magalingam et al. 2014), the shortest paths network search algorithm (SPNSA) was used to identify 


a trust network from a network of emails that have 1 and 2 bcc-ed recipients respectively to start an 
investigation. Here, we apply our SPNSA to the directed and undirected Netgraph and the BCC Netgraph. 


Different from (Magalingam et al. 2014), the directed and undirected BCC Netgraph used here contain all 


bcc-ed recipients. In this section, all the criminals in Table are used as the feed for SPNSA. There are 7 
criminals in the Netgraph and 10 criminals in BCC Netgraph (see Table 1^. 


4-1. Application of SPNSA on directed Netgraph and BCC Netgraph 

Applying the SPNSA to the directed Netgraph results in all 7 criminals occurring in the extracted 
shortest paths network except for one email ID of Lea Fastow (Net. ID 17589) (see Figure]^. Lea Fastow’s 
Net. ID 17589 that represents her second email address didn’t appear in the sub-network because the node 
doesn’t have an out-component that builds paths to other criminals or the central nodes. 


0 


0 ^ 
fc) 


0 





Figure 4: The shortest paths network formed using the directed Netgraph. All 7 criminals were found. The 
network formed is sparse and criminals’ connections can be easily identified. This sub-network contains 30 
nodes. The nodes highlighted in red represent the criminals with Net. ID in Table [T| 
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The sub-network of the directed BCC Netgraph captured using SPNSA is shown in Figure This 
network contains 8 out of the 10 criminals. The two other criminals did not have any connection to other 
criminals or to the MM or MI, thus did not occur in this sub-network. 



Figure 5: The shortest paths network formed using directed BCC Netgraph. 2 out of 10 criminals were lost. 
The network formed is sparse and criminals’ connections can be identified. The number of nodes in this 
sub-network is 55. The nodes highlighted in red represent the criminals with BCCNet. ID in Table 

^.2. Application of SPNSA on the undirected Netgraph and the undirected BCC Netgraph 

The SPNSA was then applied to the undirected Netgraph and the result is depicted in Figure]^ Again, all 
seven criminals occurred in this graph. This time all double email Net. IDs are captured in this sub-network. 
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Figure 6: The shortest paths network formed using undirected Netgraph. No criminal IDs were lost. The 
network formed is sparse and criminals’ connections can be identified. The size of this sub-network is 30 
nodes. The nodes highlighted in red represent the criminals with Net. ID in Table 

We also applied the SPNSA to the undirected BCC Netgraph and the result is shown in Figure 
The result shows that, in this case, SPNSA is able to identify two connected components, with all 10 
criminals. When compared to the result obtained from the undirected Netgraph, application of SPNSA to 
the undirected BCC Netgraph gives better results as it is able to show all the criminals’ connections and 
has more connected components (see Figure]^. 
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Figure 7: The shortest paths network formed using undirected BCC Netgraph. No criminal were lost. The 
network formed is sparse and criminals’ connections can be identified. The size of this sub-network is 74 
nodes and another small component of 3 nodes. The nodes highlighted in red represent the criminals with 
BCCNet. ID in Table 

The criminals and their links can be clearly seen in the results obtained (see Figures [s) and[^. The 
undirected BCC Netgraph yields the most number of criminals in the shortest paths network and connected 
components. A comparison of the results of using R igraph community detection algorithms with the result 
of shortest paths network search algorithm (SPNSA) applied to the undirected BCC Netgraph (see Figure 

is documented in Table 


20 


Table 8: Comparison between results found using R igraph community detection algorithms and SPNSA 



Community Detection Algorithm 

Shortest Paths Network Search Algorithm 

Undirected Netgraph 

Undirected BCC Netgraph 

Undirected Netgraph 

Undirected BCC Net- 

graph 

Community 

distribution 

Nature: large and difficult to ex¬ 
plore. Community Size: 250 < 

nodes < 26,100. 

Nature; large and difficult to explore. 

Community Size: 1,000 < nodes < 

16,000. 

Nature: sparse to in¬ 

vestigate and explore. 

Community Size; 30 

nodes. 

Nature: sparse to 

investigate and explore. 

Two components 

occurred, one of size 

74 nodes and the other 

has 3 nodes. 

Abnormalities 

Andrew Fastow’s external email 

add. (687) doesn’t exist. 

Walktrap community detection; 

found Andrew Fastow’s external 

email add. (687) and Lea Fastow 

(11010) appeared in the same small 

community of size 2 nodes. 

Andrew Fastow’s exter¬ 
nal email add. (687) 

doesn't exist. 

Found a direct connec¬ 
tion between Andrew 

Fastow (687) and 

Lea Fastow (11010) 

that emerged in a 

community of size 74 

nodes. 

A. Khan (205) doesn’t exist 

Fastgreedy and Walktrap community 

detection; found A. Khan (205) be¬ 
longs to a small community of size 3. 

A. Khan (205) doesn’t 

exist. 

Found A. Khan (205) 

belong to a small iso¬ 
lated community of size 

3. 

Total criminals 

Detection Algorithm: Fastgreedy - 

6/10 criminals in {7, 5163}. Walk- 

trap - 6/10 criminals in {1, 3126}. 

Detection Algorithm; Fastgreedy; 

5/10 criminals in {5, 2195}. Walk- 

trap; 7/10 criminals in {36, 1254}. 

8/10 criminals. 

All 10 criminals. 


Table shows the comparison between results found using R igraph community detection algorithms and SPNSA. The com- 
munity formed by SPNSA is small and suitable for investigation. The 3-clique connection is formed as a separate network 
component and the nodes that are connected to the criminal can be easily identified. In the row giving total criminals, ({7, 
5163}) refers to {itf^ community, size} and the same follows for others. 


5. Crime investigation methods using SPNSA 


Anwar and Abulaish (20121 analysed their algorithm’s performance by implementing three different sce¬ 


narios based on the availability of information. Similar to (Anwar and Abulaish 2012), here we specify 
certain ways an investigator could extract criminal subgraphs using the shortest paths network search algo¬ 
rithm for a preliminary investigation. In this section, the undirected BCC Netgraph is chosen instead of the 
undirected Netgraph due to three reasons found through our experiments in section [2^ and the comparison 
between results in Table more connections were detected between the criminals in the undirected BCC 


Netgraph (see Tables and cliques of two or three criminals were found in the undirected BCC 

Netgraph (See Table and the most number of criminals were found in the undirected BCC Netgraph (See 
Tabled. 


5.1. Extracting sub-networks using Non-Criminals 

At an intial stage of a criminal investigation, an investigator may or may not have all or any of the 
criminals’ details. The investigator could start the investigation with a suitable group of people. The 
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investigator will be able to feed in as many as necessary of the Enron managers’ or suspects’ node IDs in 
to the algorithm to form a network for their investigation. We simulate such a scenario by feeding in to the 
algorithm all the top managers in the Enron to obtain their communication network from the undirected 
BCC Netgraph. The result is shown in Figure]^ 
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Figure 8: Enron undirected BCC Netgraph shortest paths network using all top managers as algorithm feed. 
The nodes highlighted in red are all criminals. 


When compared to the network graph obtained in Figure 7 out of 10 criminals are found by the 


algorithm this time around. Apart from the managers known to be criminals (see Section 2.2), the SPNSA 
also extracts 4 other criminals; Lea Fastow (11010), Kevin Hannon (10068), Rex Shelby (15224) and Joe 
Hirko (8716) with this top manager feed test. 

Next a shortest paths network is formed using only financial managers. The financial managers’ group 
is a subset of the top managers. The financial managers are the Head of Enron Global Finance, Sherron 
Watkins (16929), the Enron Chief Financial Officer, Andrew Fastow (686 , 687), the Enron Corporation 
Treasurer, Ben Glisan (1369), the Chief Accounting Officer, Rick Causey (15077), the Chief Financial Officer 
of Enron after Andrew Fastow, Jeff McMahon (8071). The network formed is shown in Figure]^ 
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Figure 9: Enron undirected BCC Netgraph shortest paths network using all financial managers as algorithm 
feed. The nodes (686, 687 and 1369) highlighted in red are financial managers who are also criminals. The 
other criminal who was found is Lea Fastow (11010). 


When comparing the criminals found in the sub-network formed using the Finance managers (see Figure 
1^, with those found in Figure]^ we see that Lea Fastow (11010) is the only other criminal found here, other 
than the finance managers who were also criminals. 


5.2. Extracting subgraphs using leave-one-out method 

The leave-one-out method is widely used in various fields of research as a data sampling method for an 


algorithm (Cawley and Talbot 2004 Kocaguneli and Menzies, 20131 and can be used to estimate perfor¬ 


mance of a predictive model (Kocaguneli and Menzies 20131. Past research (Shao, 1996) shows that one 
can set the number of data points to be removed from sample data and use it for validation. This is also 


called delete-p cross validation (Zhang, 1993). 

We name this method as leave-C^-out. Leave-Ci-out refers to dropping one criminal (Q) from the list 
of criminals and running the shortest paths network search algorithm on the remaining criminals in the 
undirected BCC Netgraph. This method is a test of the ability of the algorithm to produce sub-networks 
that contain the convicted criminal not included in the algorithm feed. We name the criminal that is left 
out during each iteration as C^. A criminal who has two email accounts has two different BCCNet. IDs and 
during the leave-Ci-out process both their IDs are dropped from the algorithm feed. 

The results of the leave-C^-out method are given in Table In 5 out of 9 cases the criminal who is 
left out occurs in the network formed by the shortest paths algorithm. For example, we leave out Michael 
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Kopper (12708) from the feed (list of criminals) and run the algorithm. The result obtained is a sub-network 
that contains Michael Kopper and the connections of Michael Kopper (see Figure [l0|. 



Figure 10: Enron undirected BCC Netgraph shortest paths network when Michael Kopper (12708) is left 
out from the criminal feed list. The nodes highlighted in red are all criminals. 
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Table 9: Leave-Ci-out Method 


Ci Out 

BCCNet. ID 

Ci occur 

Andrew Fastow 

686, 687 

X 

Lea Fastow 

11010,11009 

V 

Kevin Hannon 

10068 

X 

Kenneth Rice 

9994 

X 

Rex Shelby 

15224,15225 

V 

Michael Kopper 

12708 

y 

Ben Glisan 

1369 

y 

Joe Hirko 

8716 

X 

S. Yaeger 

861 

y 


Table 1^ shows the result for Leave-Ci-out method. It is a test to see if the criminals that we left out still occur in 
the network formed by the shortest paths algorithm. Ci represents each criminal. The sign indicates the Ci 
appeared in the shortest paths network community while are given to Ci who do not appear in the extracted 
network. A. Khan (BCCNet. ID 205) was not included in this table, because, as shown in Figure]^ A. Khan (205) 
had no connections to other criminals, and further, appeared in a separate component containing 3 nodes. 

6. Conclusion 

The work presented in this paper contains the efhcacy of the implementation of our shortest paths network 
search algorithm on larger dataset and the searches. The existing community detection algorithms in igraph 
did show the number of communities but an investigator would need to manually check the community to 
which a criminal belongs. Retrieving the neighbourhood sub-networks of the criminals in the community, as 
identified by the existing community detection algorithms, resulted in dense networks, which were hard to 
visualise and possibly, even harder to analyse. The criminals’ connections in these sub-networks were hard 
to view. 

Our shortest paths network search algorithm (SPNSA) clearly shows the criminals’ connections to other 
nodes in all the sub-networks it extracted. Three different investigation methods were tested using the 
SPNSA; when the investigator knows all the criminals, when the investigator fails to detect one of the 
criminals and when the investigator is at the starting stage and doesn’t have any information about the 
criminals. In all three scenarios, the sub-network formed by SPNSA were sparse and hence, suitable for 
an investigator to see the connections as well as conduct further investigations. The SPNSA algorithm was 
able to extract and show the abnormalities through the sub-networks formed; components that contains 
criminals’ connections with other nodes and the 3-clique component can be easily detected. 
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The SPNSA allows the investigator to feed in the early suspects or suspicious entity into the criminal 
list of the algorithm, a function that is not available through other community detection algorithms. The 
quality of a criminal investigation can be improved when we can specify some inputs as in this algorithm; 
SPNSA could be a very useful preliminary investigation tool for an investigator. 
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